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Abstract We obtain the conditions for the emergence of the swarm 
intelligence effect in an interactive game of restless multi-armed bandit 
(rMAB). A player competes with multiple agents. Each bandit has a pay¬ 
off that changes with a probability pc per round. The agents and player 
choose one of three options: (1) Exploit (a good bandit), (2) Innovate 
(asocial learning for a good bandit among n/ randomly chosen bandits), 
and (3) Observe (social learning for a good bandit). Each agent has two 
parameters (c,Pobs) to specify the decision: (i) c, the threshold value for 
Exploit, and (ii) Pobs-, the probability for Observe in learning. The pa¬ 
rameters {c,Pobs) are uniformly distributed. We determine the optimal 
strategies for the player using complete knowledge about the rMAB. We 
show whether or not social or asocial learning is more optimal in the 
(jPcnj) space and define the swarm intelligence effect. We conduct a 
laboratory experiment (67 subjects) and observe the swarm intelligence 
effect only if (pc, ni) are chosen so that social learning is far more optimal 
than asocial learning. 
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§1 Introduction 

The trade-off between the exploitation of good choices and the explo¬ 
ration of unknown but potentially more profitable choices is a well-known prob¬ 
lem®^®. A multi-armed bandit (MAB) provides the most typical environment 
for studying this trade-off. It is defined by sequential decision making among 
multiple choices that are associated with a payoff. The MAB problem involves 
the maximization of the total reward for a given period or budget. In a variety 
of circumstances, exact or approximated optimal strategies have been proposed 

Recently, the MAB has also provided a good environment for the trade¬ 
off between social and asocial learning^. Here, social learning is learning 
through observation or interaction with other individuals, and asocial learning 
is individual learning The advantage of social learning is its cost 

compared with asocial learning. The disadvantage is its error-prone nature, as 
the information obtained by social learning might be outdated or inappropriate. 
In order to clarify the optimal strategy in the environment with the two trade¬ 
offs, Rendell et al. held a computer tournament using a restless multi-armed 
bandit (rMAB) Here, restless means that the payoff of each bandit changes 
over time. There are 100 bandits in an rMAB, and each bandit has a distinct 
payoff independently drawn from an exponential distribution. The probability 
that a payoff changes per round is Pc- An agent has three options for each 
round: Innovate, Observe, and Exploit. Innovate and Observe correspond to 
asocial and social learning, respectively. For Innovate, an agent obtains the 
payoff information of one randomly chosen bandit. For Observe, an agent obtains 
the payoff information of no randomly chosen bandits that were exploited by 
the agents during the previous round. Compared to the information obtained 
by Innovate, that obtained by Observe is older by one round. For Exploit, an 
agent chooses a bandit that he has already explored by Innovate or Observe and 
obtains a payoff. In an rMAB environment, it is extremely difficult for agents to 
optimize their choices®^ . The outcome of the tournament was that the winning 
strategies relied heavily on social learning. This contradicted previous studies 
in which the optimal strategy is a mixed one that relies on some combination of 
social and asocial learning. In the tournament, the cost for Observe was not very 
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low, as approximately 50% of the choices of Observe returns information that 
the agents already knew. The results of the tournament imply the inadvertent 
filtering of information when an agent chooses Observe, as the agents choose the 
best bandit during Exploit. 

In this paper, we discuss whether social or asocial learning is optimal in 
an rMAB, where a player competes with many agents. We answer to the question 
why social learning is so adaptive in Rendell’s tournament. We suppose that the 
cost of Innovate becomes higher than that of Observe in the tournament. In order 
to reduce the cost of Innovate, we control the exploration range nj for Innovate, 
and agents obtain the best information about the bandits among n/ randomly 
chosen bandits. An rMAB is characterized by two parameters, Pc and nj. We 
compare the average payoffs of the optimal strategies when only Innovate, only 
Observe, and both are available for learning using the complete knowledge of an 
rMAB and the information of the bandits exploited by agents. We determine 
the region in which each type of learning is optimal in the (n/,pc) plane and 
show that Observe is more adaptive than Innovate for n/ = I. We define the 
swarm intelligence effect as the increase in the average payoff compared with 
the payoffs of the optimal strategies where only asocial learning is available. We 
have conducted a laboratory experiment where 67 human subjects competed 
with multiple agents in an rMAB. If the parameters are chosen in the region 
where social learning is far more optimal than asocial learning, we observe the 
swarm intelligence effect. 

§2 Restless multi-armed bandit interactive game 

An interactive rMAB game is a game in which a player competes with 
120 agents using an rMAB. The player aims to maximize the total payoff over 
103 rounds and obtain a high ranking among all entrants. Below, we term the 
population of all agents and a player as all entrants. The rMAB has N = 
100 bandits, and we label them as n G {1,2, ■ ■ ■, N = 100}. Bandit n has a 
distinct payoff s{n), and we term the (n,s(n)) pair as bandit information. s(n) 
is an integer drawn at random from an exponential distribution (A = 1; values 
were squared and rounded to give integers mostly falling in the range of 0-10 
We denote the probability function for s(n) as Pr(s(n) = s) = P(s) (left 
figure in Figured]). We write the expected value of s(n) as E(S'(n)), and it is 
approximately 1.68. The payoff of each bandit changes independently between 
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Fig. 1 Left: Plot of P{s). The expected value of s is E(s) ~ 1.68. Right: Parameter 
assignment for agent i G {1, 2, • ■ •, 120}. Pobsii) = 0.1 X (i%10) € {0.0, 0.1, ■ • •, 0.9}. c(i) = 
i/10+ 1 e {1,2, •■■,12}. 


rounds with a probability Pc, with new payoff drawn at random from the same 
distribution. 

Every entrant has his own repertoire and can store at most three pieces 
of bandit information. The bandit information has a time stamp when the 
entrant obtains it. The time stamp is updated when the entrant obtains new 
bandit information about the bandit. When an entrant obtains more than three 
pieces of bandit information, the one with the oldest time stamp is erased from 
the repertoire. 

There are three possible moves for the entrants: Innovate, Observe, 
and Exploit. Innovate and Observe are learning processes to obtain bandit 
information. Exploit is the exploitation process that obtains some payoff. 

• Innovate is individual learning, and an entrant obtains bandit informa¬ 
tion. 71/ bandits are chosen at random among N = 10^ bandits, and the 
bandit information with the maximum payoff is provided to the entrant. 
If there are several bandits with the same maximum payoff, one of them 
is chosen at random. 

• Observe is social learning, and an entrant obtains the bandit information 
exploited by an agent during the previous round. If there are many agents 
who exploited a bandit, an agent is randomly chosen among them, and 
its bandit information is provided to the entrant. If there are no such 
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agents, no bandit information is provided. The information obtained by 
Observe is one round older than that obtained by Innovate. 

• Exploit is the exploitation of a bandit. An entrant chooses a bandit from 
his repertoire and exploits the bandit. Even if the bandit information is 
(n, s(n)), as the information changes with a probability Pc per round, he 
does not necessarily receive the payoff s{n). 

The repertoire is updated after a move. For Innovate, the bandit in¬ 
formation with the maximum payoff s/ among n/ randomly chosen bandits is 
provided to the entrant. We denote the distribution function of sj as Pi{s) = 
Pr(s/ = s). Intuitively, s/ is chosen in the region of upper probability 1/n/ of 
P{s). We denote the expectation value of s/ as E(s/). If m > 1, E(s/) > E(s) 
holds. For example, E(s/) ~ 9.63 for nj = 10. By controlling nj, we can change 
the cost of Innovate. 

2.1 Agent strategy 

We explain the strategy of the agents. The most important factor in 
the performance of the strategies in Rendell’s tournament was the proportion 
of Observe in learning . The high performance of Observe originated from 
the inadvertent filtering of bandit information, as the agents exploited the best 
bandit in their repertoires. If the agents choose at random, Observe does not 
provide good bandit information. We take these facts into account and introduce 
a simple strategy for the agents with two parameters c and Pobs ■ 

• c: every agent has a threshold value c. If there is no bandit in one’s 
repertoire whose payoff is greater than c, the agent will learn by Innovate 
or Observe. 

• Pobs- an agent chooses Observe with a probability Pobs when he learns. 

We label 120 agents as z G {1, 2, • • •, J = 120}. Agent i has the param¬ 
eters {c{i),Pobs{'i))- c{i) is given as the quotient i/10 plus one. Pobs{i) is the 
remainder of i%10 multiplied by 0.1. The assignment of (c^pobs) to agent i is 
represented in the right figure in Figure [H 

2.2 Game environment 

A player participated in a game and competed with N agents. However, 
the game did not advance on a real-time basis. Agents had already participated 
in the game for 1000 rounds. When a player participated in the game, 103 
sequential rounds were randomly chosen from the 1000 rounds, and he competed 
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Now 4/100 rounds, your ranking is 33/121 
The score is 19 pts 
Please select from blow. 

Innovate 

No.10,Point.8 No.33,Point.11 No.41 ,Point.8 

Observe 

logout 

Fig. 2 The interactive rMAB game online interface. A human player is presented with 
the present round t/100, his ranking among 121 entrants (one player and 120 agents), and 
his repertoire. He must choose one among Innovate, Exploit a bandit, and Observe. In his 
repertoire, only (n,s(n)) is shown. The bandit information from left to right indicates the 
newest to oldest information, respectively. 


with agents for 103 rounds. We denote the round by / G {—2, —1,0,1, 2, • • •, T = 
100}. The scores of the player and agents were set to zero. The agents had 
already stored at most three pieces of bandit information in their repertoires. 
The player had three rounds to learn the rMAB. He could choose Innovate or 
Observe for three rounds and stored at most three pieces of bandit information 
in his repertoire. After three rounds, the rMAB game started. As the agents had 
already finished the game, they could not observe the information of the player. 
On the other hand, the player could observe the information of the agents. 

The game environment was constructed as a website. The information 
of the agents for 1000 rounds was stored in a database of the website. The player 
used a tablet (7 inch) and participated in the game through a web browser. The 
player had to learn for three rounds and stored at most three pieces of bandit 
information in his repertoire. Afterwards, the game started. Figure [2] shows the 
interface of the rMAB game. For the present round /, the ranking and score 
are shown on the screen. The player had to choose an action among Innovate, 
Exploit, and Observe. For Exploit, the player had to choose which bandit he 
would exploit in his repertoire. Then, the payoff and new ranking were shown 
on the screen, and the game proceeded to the next round. 
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For the parameters {ni,pc) of the rMAB, we adopted the next four 
combinations. We call the combinations A, B, C, and D. 

A: {ni,pc) = (1,0.1). Pc is small and the change in the payoff of a bandit is 
slow. As n/ = 1, E(s/) = E(s), and it is difficult to find a bandit with 
high payoff with Innovate. 

B: {ni,pc) = (10,0.1). Pc is small, as in A. As nj = 10, E(s/) ~ 9.63 is 
large, and good bandit information can be obtained with Innovate. 

C: (ni,pc) = (1,0.2). pc is large, and the bandit information changes fre¬ 
quently. As n/ = 1, it is difficult to obtain good bandit information with 
Innovate. 

D: {ni,pc) = (10,0.2). pc is large, as in C. As nj = 10, good bandit infor¬ 
mation can be obtained with Innovate. 

2.3 Experimental procedure 

The experiment reported here were conducted at the Information Science 
room at Kitasato University. The subjects included students from the university, 
mainly from the School of Science. The number of subjects S was 67. Each 
subject participated in the game at most four times. 

The subjects entered a room and sat down on a chair. After listening 
to a brief explanation about the experiment and reward, they signed a consent 
document for participation in the experiment. Afterwards, they logged into the 
experiment website using the IDs written on the consent document. The game 
environment was chosen among the four cases A, B, C, and D, and they started 
their games. After 100 -I- 3 rounds, the game ended. The subjects logged into 
the website again to participate in a new game. Within the allotted time of 
approximately 40 min, most subjects participated in the game at least three 
times. Subjects were paid upon being released from the experiment. 

There were slight differences in the experimental setup and rewards 
among the subjects. For the first 21 subjects (July 2014), there was no par¬ 
ticipation fee. The reward was completely determined by the number of times 
that they entered the Top 20 among the 120-1-1 entrants in each game. Their 
rewards were a prepaid card of 300 yen (approximately $2.50) for each placement 
within the Top 20. The subject could choose the game environment at the start 
of the game. They could choose each environment at most once, and the aver¬ 
age number of subjects in each environment is approximately 19. They did not 
know the parameters of each environment. For the last 46 subjects (December 
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2014) there was a 1050 yen (approximately $9) participation fee in addition to 
the performance-related reward. The reason for the change in the reward is to 
recruit more subjects. They were asked to play the game at least three times 
during the allotted time. The game environment was randomly chosen by the 
experimental program. The average number of subjects in each environment is 
approximately 37. A total of 67 subjects participated in the experiment, and we 
gathered data from approximately 56 subjects for each game environment. 

§3 Optimal strategy and swarm intelligence ef¬ 
fect 

We estimate the expected payoff of the optimal strategies for the player 
in the rMAB game. Here, optimal means to maximize the expected total payoff 
in a total of 100 + 3 rounds. For the first three rounds {t G {—2, —1, 0}), the 
player could choose Innovate or Observe. After that, he could choose all three 
options. The optimal choice for round t is defined as the choice that maximizes 
the expected payoff obtained during the remaining T — t rounds. 

We assume that the player has the complete knowledge about the rMAB 
game. More concretely, he knows Pc, E(s), and E(s/) about the rMAB. Further¬ 
more, he knows the bandit information exploited during the previous round. We 
denote the average value of the payoff of the exploited bandit at round t — I 
as 0{t). If the player chooses Observe for round t, the expected value of the 
payoff of the obtained bandit information is 0{t). 0{t) depends on the agents’ 
choices in the background. It is usually the most difficult quantity to estimate 
for the player in the game, as it depends on the strategies of the agents. With 
this information, we estimated the expected value of the payoff per round for 
the remaining rounds for each choice. 

We assume that there are M pieces ofbandit information in the player’s 
repertoire at round t. We denote them as (n™, Sm, tm), m G {1, ■ ■ ■ ,M}. Here, 
tm is the round during which the player obtained the information. When, the 
player obtains information from Innovate or obtains updated information from 
Exploit at t', tm = t'. If the player obtains information from Observe at t', 
tm = t' — 1, as Observe returns the bandit information from the previous round, 
t' - 1. 

We denote the expected value of the payoff per round for exploiting 
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bandit Um from t to T as This quantity is estimated as 


Em{t) — E(s) + 


T-t + 1 




= - MT-t + i) - « 


where {l—pcY is the probability that the bandit information does not change 
from tm until t'. During this period, the payoff is Sm- If the bandit information 
changes until t', the probability for it is 1 — (1 —PcY , and the expected payoff 
of the bandit is given by E(s). By summing these values and dividing by the 
number of rounds T — t + 1, we obtain the above expression. 

We denote the expected payoff per round for Innovate as I{t). For Inno¬ 
vate, a player does not receive any payoff. He only obtains bandit information, 
and the expected value of the payoff of the obtained bandit information is E(s/). 
We estimate the expected value of the payoff by Innovate by assuming that the 
player continues to exploit the new bandit with the payoff E(s/) from round t-l-1 
to T as 


I{t) 


T-t 
T-t+l 


E(s)+ 


(l-(l-p,)^-*)(E(s,)-E(s))(l-p,) 
Pc{T - t -I- 1) 


•( 2 ) 


As the player loses one round because of Innovate, the prefactor in front of E(s) 
and the power of (I—Pc) are reduced to {T—t)/{T—t+l) and {T—t) as compared 
with those in eq.dl]). If m = 1, E(s/) = E(s), the second term vanishes, and 
Innovate is almost worthless. For cases in which all of the payoffs of the bandit 
information in one’s repertoire are zero or less than E(s), it might be optimal to 
choose Innovate. Otherwise, instead of losing one round and obtaining bandit 
information with a payoff E(s), it is optimal to choose Exploit with the maximum 
expected payoff. If Pc is large, even if all the payoffs in one’s repertoire is zero, 
(1 — Pc)*”*”* can be negligibly small, and it is optimal to choose Exploit. When 
nj > 1 and Pc are not very large. Innovate might be optimal. 

Likewise, we estimate the expected payoff per round for Observe, which 
we denote as 0(t). For Observe, a player obtains bandit information with a 
payoff 0(t). The age of the information is one round older than the information 
obtained by Innovate. We change E(s/) to 0{t) in eq.([2|). Accounting for the 
age of the new bandit information, we estimate 0{t) as 
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0{t) 


T-t (l-(l-p,)^-‘)(0(i)-E(s))(l-p,)2 

- P.{T-t + i) - 


Comparing I{t) and 0{t), which is more optimal depends onpc and E(s/) —0(t). 
If Pc is small and 1 — pc ~ 1, the magnitude of the relationship between E(s 7 ) 
and 0{t) determines which is more optimal. 

The optimal strategy is to choose the action with maximum expected 
payoff during every round t S {—2,—= 100}. For example at t = T, 
the last round of the game, as I{T) = 0{T) = 0 holds, it is optimal to choose 
Exploit for bandit m with the maximum Em(T). In the first three rounds where 
the player can choose only Innovate or Observe, if both pc and nj are small, 
E(s 7 ) < 0{t) usually holds. Observe is more optimal than Innovate in this 
case. The situation is the same in later rounds, and the optimal strategy is a 
combination of Exploit and Observe. Conversely, if both pc and nj are large, 
even if 0{t) ~ E(s 7 ), (1 — Pc) < 1 and I(t) > 0(t) hold. 

We estimate the expected payoff per round for several “optimal” strate¬ 
gies with a restriction on the choice of learning. We consider three strategies, 
and an Exploit-only strategy as a control strategy. 

• I-l-O: The player can choose both Innovate and Observe when learning. 
In the hrst three rounds. Innovate is chosen. Then, the action with the 
highest expected payoff is chosen in the later rounds. 

• I: The player can choose Innovate for learning. The other conditions are 
the same as I-l-O. 

• O: The player can choose Observe for learning. The other conditions are 
the same as I-l-O. 

• EO: The player can choose Exploit with the maximum expected payoff 
after the first three rounds. 

The expected payoffs per round for these strategies are written as I+O, I, O, 
and EO, respectively. We also denote the expected payoff per round for agent i 
as P(i). They are estimated by a Monte Carlo simulation. We have performed 
a simulation of a game in which 120 agents and four players with above strate¬ 
gies participate 10^ times. As we have explained in the experimental procedure, 
the agents cannot observe the bandit information exploited by the player. Only 
player can observe the bandit information of the agents. As there is no interac¬ 
tion between the players, we can estimate the expected payoffs of the four players 
simultaneously. In the experiment, the player can choose Observe for the hrst 
three rounds. With the above strategies, the player can choose Innovate only 
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Fig. 3 Optimal learning in {nj,pa). The thick solid line shows the boundary between the 
region I>0 and the region 0>I. The dotted line shows the boundary beyond which E0~I-|-0. 


for simplicity. The players and agents compete on equal terms. 

We summarize the results in Figure [31 In {nj,pc) plane, we show which 
strategy is more optimal, I or O. The thick solid line shows the boundary where I 
= O. In the lower-left region 0>I holds. As nj and Pc are small, the relationship 
0{t) > E(s/) holds, and Observe becomes a optimal learning method. In the 
upper-right region, I > O holds, n/ is large, and £( 57 ) is greater than or 
comparable to 0{t). As Pc is large, the one round delay for exploiting the 
bandit information obtained by Observe might be crucial. The thin dotted line 
shows the boundary beyond which I+O is comparable with EO. As Pc is large, 
the player can obtain comparable payoffs by only exploiting a good bandit in 
his repertoire. There is neither an exploitation-extrapolation trade-off nor a 
social-asocial learning trade-off above the dotted line. It is a noise-dominant 
region. 

One can understand why there is no social-asocial learning trade-off in 
Rendell’s tournament^. In the tournament, they set ni = 1 and no > I- Here, 
no is the amount of bandit information obtained by Observe. If n/ = 1, as we 
have explained previously, £( 57 ) = E(s) holds. If Pc is small, 0{t) is usually 
greater than E(s), as agents exploit the good bandit in their repertoire. Then, 
an agent can obtain good bandit information by Observe, and Observe becomes 
an optimal learning method. If Pc is too large, instead of trying to obtain 
good information with Innovate, it is optimal to wait spontaneous changes in 
the bandit information in the repertoire. Exploiting a good bandit in one’s 
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repertoire (EO strategy) is enough, and no other strategy cannot exceed the 
performance of EO. 

In the region where 0>I and I+0>E0, social learning is effective, and 
a swarm intelligence can emerge. We define the swarm intelligence effect as the 
increase in the performance compared to I. In the next section, we estimate the 
swarm intelligence effect for human subjects. As for the choice of {nj,pc), we 
have studied four cases A:(I,0.I), B:(10,0.1), C:(l,0.2), and D:(10,0.2). We show 
the positions for these choices in figure [S] For cases A and C, O >>I, and one 
expect to observe the swarm intelligence effect in human subjects. For cases B 
and D, where 0~I and 0<I, one does not expect to observe it. 

We make a comment about the definition of the swarm intelligence effect. 
For the estimation of I, we assume that the player knows Pc, E(s), and E(s 7 ) 
and can choose the best option among Exploit and Innovate. The player has 
to estimate this information from his actions in the real game. If the player 
cannot choose Observe, his performance cannot exceed I. The definition of the 
swarm intelligence effect only provides a lower limit. Toyokawa^ defined it as 
the surplus in performance compared to when the same player can only choose 
Innovate. Our definition has the advantage that it can be estimated easily 
without performing an experiment. The same reasoning applies to I+O. In this 
case, the player knows everything that is related with his decision making. I+O 
provides an upper limit on the performance of the player in the game. 

§4 Experimental results 

In this section, we explain the experimental results. We estimate the 
swarm intelligence effect for human subjects. We perform a regression analysis 
of the performance of each subjects in each experimental environment. 

4.1 Swarm intelligence effect 

We calculated the total payoff of each subject for 100+3 rounds in each 
game environment. We divided the total payoff by 100 and obtained the average 
payoff per round. For each game environment, we estimated the average value 
of the average payoffs per round for approximately 56 subjects and denote it as 
H. This represents the average performance of human subjects in each case. We 
compare H with I+O, I, O, and P(i) for agent i. Figure|4]show the results for 
cases A, B, C, and D. We explain the results of each case. 

A: Case A is in the region where where 0>I, and Observe is optimal for 
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A: p^=0.1,n|=1 B: p^=0.1,n|=10 





Fig. 4 Plots of P(i)(n), I+O(o), !(•), 0(A) and H(x). A:(pc,nj) = (0.1,1), H(54)= 
8.0 ± 0.6. B:(pc, nj) = (0.1,10), H(65)= 11.1 zb 0.5. C:(pc, nj) = (0.2,1), H(54)= 5.2 ± 0.3. 
D:(pc,nj) = (0.2,10), H(52)= 6.7 ib 0.3. The number in each parentheses is the number of 
subjects in each case. 


learning. As I+OcsO and O is much greater than I, one can expect the 
swarm intelligence effect. In fact, H, which is plotted with a chain line, 
is higher than I. For a fixed value of c, P(i) increases with pobs- For the 
dependence of P(i) on c for a fixed value of Pobs, there is a maximum for 
some c. For pobs = 0.0, P(i) is maximum at c ~ 4. For pobs = 0.9, P(i) 
is maximum at c ~ 8.5. The agent can obtain good bandit information 
by Observe, and they had better to adopt large c. 

B: Case B is in the region where 0>I. However, it is near the boundary for 
1 = 0, and the difference between O and I is small. One cannot expect 
the swarm intelligence effect. In fact, H is below I. As IcsO, subjects 
could not improve their performance by Observe. One see I+0>0, and 
the difference between I+O and O is small. As n/ = 10 is large. Innovate 
is frequently more optimal than Observe. For example, if an agent finds 
that the payoff of good bandit information changes to zero in round t 
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by Exploit, one can suspect that Observe does not provide good bandit 
information. In particular, if the bandit is good, and the payoff is high, 
one can assume that many agents also exploited the bandit. Then Observe 
should provide bandit information with zero payoff in round t + 1 with a 
high probability. 

P(i) is an increasing function of Pobs when c is small. When c is large, 
P(i) does not depend on c very much. When c is small, agents can 
easily obtain bandit information whose payoff is greater than c. Then, 
the agent exploits the not so good bandit. On the other hand, if the 
agent obtains bandit information by Observe, he can obtain good bandit 
information, as the bandit’s payoff exceeds the other agents’ c. Observe 
is more optimal than Innovate when c is small. However, when c is large, 
the agent can obtain good bandit information by Innovate, as nj is large. 
By Observe, the agent can obtain good bandit information, and there is 
not a big difference in the performance of I and O. As a result, P(i) does 
not depend very much on pobs when c is large. 

C: Case C is in the region where 0>I. I+O::^ O, and O is much greater than 
I, as with case A. One can observe the swarm intelligence effect because 
H is greater than I. Because Pc is large, the expected payoffs and average 
payoff of the subjects are lower than those for case A. 

D: Case D is in the region where I>0, and Innovate is optimal for learning. 
As I+0>I, Observe is optimal in some cases. When c is large, P(i) 
is a decreasing function of Pobs- As both pc and nj are large, instead 
of obtaining good bandit information by Observe, Innovate succeeds in 
obtaining new and good bandit information. When c is small, as in case B, 
the agent can obtain better bandit information by Observe than Innovate. 
One cannot observe the swarm intelligence effect, as in case B. 

4.2 Regression analysis of the performance of individual 
subjects 

We perform a statistical analysis of the variation in the payoffs of the 
subjects in the four cases. We examined the factors that made strategies suc¬ 
cessful by using a linear multiple regression analysis. In Rendell’s tournament, 
there were hve predictors in the best-ht model for the performance of the strate¬ 
gies^. Among them, we considered three predictors: Hearn, the proportion of 
moves that involved learning of any kind; robs, the proportion of learning moves 
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that were Observe; and Atieam, the average round between learning moves. 
Other predictors were the variance in the number of rounds to first use of Ex¬ 
ploit and a qualitative predictor of whether or not the agent program estimates 
Pc- For the latter, we suppose that human subjects estimated Pc, or they could 
notice whether the frequency of the change in bandit information is high or low. 
For the former predictor, it is impossible to estimate it, as the subjects partic¬ 
ipated in the game at most once for each case. We do not include these two 
predictors in the regression model. We denote the average payoff per rounds 
for subject j as payof f{j). The multiple linear regression model is written as 
payoffij) = ao + ai-riearnij) + a2-robsU)+0'3-^tiearnU)- We Select the model 
with maximum . The results are summarized in Table [H 

Table 1 Parameters of the linear multiple regression model predicting the average payoff 
per round in each game environment. From the second to fourth columns, the intercepts 
and regression coefficients for Hearn? ^o6s? Atieam ^re shown, n.s. for p > 0.05,* for 
p < 0.05,** for p < 10-2,*** for p < 10“® and **** for p < lO”"*. 


Case 

Intercept 

t'learn 

‘fobs 

^^learn 


A(53) 

10.0 (****) 

-11.0(**) 

3.3 [p = 0.16) 

n.s. 

0.253 

B(65) 

14 1 

-10.7 (*) 

n.s. 

n.s. 

0.076 

C(54) 

8.37 (***) 

-8.2 (**) 

3.0 (*) 

-0.49 (p = 0.06) 

0.186 

D(52) 

g J 

-7.2 (**) 

1.4 (p = 0.23) 

n.s. 

0.144 

ALL(224) 

12.0 (****) 

-13.8 (****) 

1.3 (p = 0.15) 

n.s. 

0.246 


fleam had a negative effect on the performance of the subjects, as in 
Rendell’s tournament. This result suggests that it is suboptimal to invest too 
much time in learning, as one cannot obtain any payoffs for learning. For robs, 
the results are not consistent with the results of Rendell’s tournament. There, 
the predictor had a strong positive effect, which reflected the fact that the best 
strategy was to almost exclusively choose Observe rather than Innovate. In our 
experiment, the predictor seems to have a positive effect for cases A, C, and 
D. For cases A and C, it is consistent with the results in the previous section 
because 0>I, and Observe is more optimal than Innovate. In case D, as H is 
much less than both I and O, obtaining good bandit information from the agents 
by Observe might improve the performance. 


§5 Conclusion 
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In this paper, we attempt to clarify the optimal strategy in a two trade¬ 
offs environment. Here, the two trade-offs are the trade-off of exploitation- 
exploration and that of social-asocial learning. For this purpose, we have devel¬ 
oped an interactive rMAB game, where a player competes with multiple agents. 
The player and agents choose an action from three options: Exploit a bandit, 
Innovate to obtain new bandit information, and Observe the bandit information 
exploited by other agents. The rMAB has two parameters, Pc and n/. Pc is the 
probability for a change in the environment, n/ is the scope of exploration for 
asocial learning. The agents have two parameters for their decision making, pobs 
and c. Pobs is the probability for Observe when the agents learn, and c is the 
threshold value for Exploit. 

We have estimated the average payoff of the optimal strategy with some 
restrictions on learning and complete knowledge about rMAB and the bandit 
information exploited during the previous round. We consider three types of 
optimal strategies, I+O, I, and O, where both Innovate and Observe, Innovate, 
and Observe are available. In the {ni,pc) plane, we have derived the strategy 
that is more optimal, either O or I. Furthermore, we have defined the swarm 
intelligence effect as the surplus of the performance of I. The estimate of the 
swarm intelligence effect provides only a lower bound for it; however, the es¬ 
timation is easy and objective. We also point out that the swarm intelligence 
effect can be observed in the region of the (n/, pc) plane where O is more optimal 
than I. We have performed an experiment with 67 subjects and have gathered 
approximately 56 samples for the four cases of {ni,pc). If {ni,pc) are chosen 
in the region where O is far more optimal that I, we have observed the swarm 
intelligence effect. If (nj,pc) are chosen near the boundary of the two regions 
or in the region where I is more optimal than O, we did not observe the swarm 
intelligence effect. We have performed a regression analysis of the performance 
of each subject in each case. Only the proportion of learning is the effective 
factor in the four cases. In contrast, the proportion of the use of Observe for 
learning is not significant. 

As the agent’s decision making algorithm is too simple, it is difficult to 
believe that the conditions for the emergence of the swarm intelligence in Figure 
[3] are general. In addition, the analysis of the human subjects is too superficial, 
as we only studied the correlation between the performance and some predictive 
factors. With these points in mind, we make three comments about future 
problems. 
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The first one is a more elaborate and autonomous model of the de¬ 
cision making in an rMAB environment. The algorithm needs to estimate 
Pc, E(s), E(s 7 ), and 0{t) for round t on the basis of the data that the agent 
has obtained through his choices. Then, the agent can choose the most optimal 
option during each round and maximize the expected total payoff on the basis 
of these estimates. This is an adaptive autonomous agent model. With this 
model, we can understand the decision making of humans in the rMAB game 
more deeply. It is impossible to understand human decision making completely 
with experimental data. On the basis of the model, we can detect the deviation 
in human decision making and propose a decision making model for a human 
that can be tested in other experiments. 

The second one is the collective behavior of the above adaptive au¬ 
tonomous agents or humans. It is necessary to clarify how the conditions for the 
emergence of the swarm intelligence effect would change. In the case of a popu¬ 
lation of adaptive autonomous agents, they would estimate the optimal value of 
Pobs for the environment {ni,pc) and collectively realize the optimal value. The 
optimal strategy should be neither I nor O but a mixed strategy of Innovate and 
Observe. Then, the condition for the emergence of the swarm intelligence effect 
is that the performance of I+O is equal to that of I. If the performance of the 
former is greater than that of the latter for any (n/,pc), the swarm intelligence 
effect can always emerge, except for the noise-dominant region. After that, we 
can study the conditions with human subjects experimentally. A human subject 
participates in the rMAB game as a player, as in this study, or many human 
players participate in the game to compete with each other. The target is how 
and when humans collectively solve the rMAB problem. 

The third one is the design of an environment in which swarm intelli¬ 
gence works. In this study, we choose the rMAB interactive game and study 
the conditions for the emergence of the swarm intelligence effect for a player. 
However, there are many degrees of freedom in the design of the game. For 
example, when an agent observes, there are many degrees of freedom regarding 
how bandit information is provided to the agent. In the present game environ¬ 
ment, the probability that a bandit exploited in the previous round is chosen 
is proportional to the number of agents who have exploited it. Instead, we can 
consider an environment in which the bandit information of the most exploited 
bandit is provided, the bandit information of the agents who are near the agent is 
provided, or the player can choose a bandit by showing him the number of agents 
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who have exploited the bandit. We think these changes should affect the choice 
and performance of the player. It was shown experimentally that by providing 
subjective information about a bandit, the performance of the subjects dimin¬ 
ished ^ . We think that the interaction between the design of the environment 
and the decision making, performance, and swarm intelligence effect should be a 
very important problem in the industrial usage of the swarm intelligence effect. 
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